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Processing  Rate  Sensitivities 
of  a Heterogeneous  Multiprocessor 


Gordon  Lyon 

Key  words:  Application  signature;  architecture;  capacities; 
models;  performance;  sensitivities. 


Simple  performance  characterizations  of  multiprocessors  show  that  such  models, 
freed  of  specialized  detail,  apply  easily  and  widely.  They  can  convey  insight  about 
parallel  e&iency  [4]  or  scheduling  [6].  Processing  rate  sensitivities  are  also  amenable 
to  a simple  approach.  Consider  a parallel  program  whose  code  is  represented  via 
resource  demands,  a={a.}.  The  coefficients,  which  sum  to  unity,  reflect  how  the 
program’s  total  computation  divides  among  a system’s  disjoint  computational  modes. 
Modes  typically  differentiate  on  type  (say  scalar  or  vector)  or  number  (single  vector  unit 
or  chained)  or  both.  Workload  flections  for  a vector  program  might  be  {0.3,  0.7),  for 
demands  of  scalar-mode  (1-processor)  and  vector-mode  (1-processor),  respectively  [2]. 
Mode  capacities  determine  a second  set  of  coefficients,  r={r.}.  These  are  rates  of 

satisfying  demand.  A typical  set  of  scalar-mode/vector-mode  capacities  is  {10,  110}, 
measured  in  millions  of  floating-point  operations  per  second  (Mflops).  Program 
performance  as  an  average  processing  rate  R is  estimated  by  equating  "time=time"  [1,2, 
7]: 


JL 

R 


tti  a2 
— + — + .... 
t'2 


(0 


Combining  the  example  coefficient  sets,  R = [0.3/10  + 0.7/110]'^  = 27.5  Mflops. 
Coefficients  for  (i)  also  contribute,  along  with  scheduling,  to  program  rate  sensitivities. 


Sensitivity  studies  of  capacity  on  otherwise  fixed  structures  (architectures)  are 
common  in  engineering  disciplines,  e.g.,  [8].  Analogous  programming  studies  emphasize 
where  an  application  class  (represented  by  signature  a)  benefits  from  component 
(capacity)  changes  on  a host  computer  architecture.  However,  set  a is  not  generally 
independent  of  capacity  changes,  which  can  upset  processor  load  balances.  The 
redistribution  of  a is  modeled  via  best  and  worst  cases  of  rescheduling,  which  together 
establish  accuracy  limits  of  estimates  [5].  While  the  net  influence  of  workload 
redistribution  may  be  small  for  some  systems,  an  object  lesson  can  be  made  of  an  n- 
processor 
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MIMD  shared-memory  machine  for  which  demand  changes  do  matter.  The  context  is 
that  of  individually  exchanging  regular  processors  for  faster  ones  to  accelerate  Quicksort 
execution:  It  happens  that  upgrading  many  of  the  processors  may  give  a poor  return  on 
investment. 


Upgrading  k of  n Processors 


Imagine  a multiprocessor  with  n identical  processor  units.  In  this  case,  modes  can 
differentiate  solely  on  the  number  of  active  processors.  Let  i denote  a mode  of  computing 
with  i processors^  the  other  n-i  being  idle.  Capacities  {r.)  then  summarize  average 
system  rates  for  multiprocessing  levels,  e.g.,  rj2  is  the  average  collective  rate  for  12 
active  processors.  The  r^  have  been  normalized  such  that 


R 


original 


{comparison  basis) 


n 

constrained  by  = 1 (full  workload) 

1=1 


Suppose  that  ^ of  the  n processors  can  each  be  boosted  to  an  improved  processing 
rate  i+A  faster  than  the  original  rate  retained  by  all  others.  Overall  improved 
performance  then  hinges  upon  rebalancing  disparities  in  processor  loading.  Two 
example  scheduling  strategies  are  examined,  one  that  uses  the  added  power  and  another 
that  ignores  it  except  for  a^,  the  serial  mode. 


Scheduling  I:  Ideal  Balancing  within  a Mode.  One  "best  case"  assumes  a flexible 
system  with  fine-grained  load  balancing  characteristics.  The  system  redistributes 
computing  within  a mode  with  no  noticeable  overhead.  This  ensures  that  all  k improved 
processors  are  engaged  to  the  degree  that  an  application  permits,  and  is  denoted  A 
shared-memory  design  might  approach  this  ideal  improvement,  whereas  a loosely- 
coupled  system  probably  cannot  unless  computation  grain  is  coarse  enough  to  mask 
communication  latencies.  It  is  assumed  that  algorithmic  constraints  prevent  coefficients 
a.  from  changing.  However,  any  workload  portion  that  runs  with  j parallel  processors, 

j i k,  nonetheless  speeds  up  to  the  extent  offered  by  k faster  processors.  This  assumption 
appears  in  the  new  collective  rate  for  mode  j,  which  is  rj(l+[k/j]A).  (Only  k of  the  j are 
improved.)  It’ may  require  almost  magical  reassignments  among  faster  and  slower 
processors  to  keep  various  portions  from  getting  ahead  and  starving,  which  would 
sacrifice  some  workload  to  a lesser  mode.  The  following  holds: 
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Scheduling  H:  Only  Serial  Speedup.  Let  R^Q^denote  the  worst  scheduling,  which 
systematically  excludes  fast  processors  whenever  possible.  An  intuitive  argument  for 
this  redistribution  predicts  no  improvement  except  when  all  n processors  have  been 
upgraded.  Otherwise  there  is  always  some  slow  processor  that  delays  completion.  While 
a detailed  argument  is  possible  that  accounts  for  details  of  the  coefficients  (see  the 
appendix),  its  predictions  hardly  differ  in  usual  circumstances  from  the  intuitive  view. 
Thus  * ^original  “ k of  n processors  faster  by  1+A. 

Strategy  R^q^  is  obviously  deficient  for  reasonable  cases.  Improvement  R^^  lets  the 

scheduler  dispatch  all  serial  workload  (mode  1)  on  the  k faster  processors.  The  effort  to 
do  this  is  usually  minimal,  there  being  no  other  processes  to  interfere.  An  exception 
would  be  exceptionally  brief  serial  sections  spread  scattershot  across  all  processors;  here, 
process  relocation  overhead  might  be  high.  The  example  in  the  next  section  does  not 
have  this  problem.  It  has  around  20  seriafi  epochs,  each  averaging  90  ms  duration.  R.^., 
profitable  because  most  programs  have  some  region  of  serial  bottleneck,  is  assumed 
henceforth  to  be  the  worst  case  scheduling.  This  can  be  expressed  as 


1 ^ 1 _ «i  ^ ai  _ 

■^(1)  ^(0)  '‘i(l  + A)  ri  ri(l  + A) 


Example:  Actual  Sort  Workload 


Scheduling  performances  really  depend  upon  coefficient  sets.  For  this  example,  the 
machine  and  its  normalized  capacities  are  modeled  as  a shared  memory  system  without 
much  memory  or  bus  contention.  Each  added  processor  diminishes  overall  performance 
only  a half  percent,  to  99.5%  what  it  would  otherwise  be.  With  16  processors,  this  limits 
efficiency  to  92.8%  of  the  sum  of  individual  processors.  A parallel  Quicksort  illustrates, 
for  data  sets  of  31000  values,  some  consequences  of  a divide-and-conquer  paradigm.  On 
a 16-processor  computer  with  shared  memory.  Quicksort  demand  coefficients  a-  are 
typicdly  small  except  for  the  16-processor  mode,  All  demand  coefficients  (column 
3,  below)  are  actual  values  measured  via  special  low-perturbation  methods  [3, 
esp.  Fig.  4]. 
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Number  of  Normalized 
Processors,  Capacity, 

Mode  i Tj 


Resource 

Demand, 


01 

.111 

20  trial  ave. 
.026 

02 

.222 

.023 

04 

.440 

.036 

08 

.862 

.057 

10 

1.067 

.001 

12 

1.268 

.001 

14 

1.465 

.002 

16 

1.657 

.854 

Capacity-and-Use  Profile-Parallel  Quicksort 


Estimates  are  made  for  improvement  in  one,  two,  and  four  of  the  processors. 
Processor  speedup  ranges  from  three  to  nine.  Best  and  worst  scheduling  cases  (±x  in 
table  below)  are  established  via  and  R^j^  respectively.  These  scheduling  tolerances 
should  be  assessed  relative  to  other  sources  of  variation.  For  instance,  demand  fluctuates 
from  trial  to  trial,  the  greatest  change  being  when  workload  is  exchanged  between 

and  ttj  (max,  and  min.  processing  rates).  Since  = 0.88  and  "^(^figinal  ~ ^ 
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Looking  at  relative  changes,  (9R/3ajg)(a^g/R)  = (8.4)(0.854/l)  = 12.  Consequently, 
even  minor  coefiBcient  fluctuations  among  trials  pose  large  performance  uncertainties; 
actual  rangings  of  over  20  trials  were  +3%,  -2%  about  its  mean.  In  this  light, 
scheduling  tolerances  are  acceptable. 


16  Processors 

Improved,  Faster  Processor 

Improved:  Base 

3x(^=2) 

5;c  (A=4) 

9x  fA=5) 

4:12 

1.55  ±24% 

1.99  ±38% 

2.79  ± 55% 

2:14 

1.41  ± 16% 

1.67  ±27% 

2.13  ±41% 

1:15 

1.31  ± 10% 

1.47  ± 17% 

1.73  ±27% 

Relative  Improvements,  Sort  Workload 


Minimal  performance  gain  for  any  configuration  is  determined  by  serial  speedup  in 
the  worst  case  scheduling,  since  R^j^  assumes  faster  processors  only  help  mode  1. 

Consequently,  each  column  has  the  same  minimal  performance.  More  than  one  faster 
processor  is  generally  no  help.  Substituting  a 3x  faster  processor  adds  the  equivalent  of 
2 processors,  or  2/16=12,5%,  to  the  original  configuration.  This  plus  alternate 
substitutions  of  5x  and  9x  in  the  1:15  configuration  (additions  of  12.5,  25,  and  50% 


=4. 


overall)  produce  guaranteed  gains  of  18,  23  and  26%,  respectively.  Clearly,  an 
economical  improvement  for  minimal  scheduling  performance  is  to  substitute  one 
moderately  faster  processor. 

Maximum  performance  does  benefit  from  numerous  faster  processors.  Each  super-, 
main,  and  sub-diagonal  of  the  table  matrix  identifies  configurations  that  have  equivalent 
processor  power.  That  is,  four  processors  boosted  to  3x  (extra  capacity  of  2x  each) 
equals  two  processors  5x  faster  (extra  capacity  of  4x  each).  But  yield  along  a diagonal  is 
unequal  for  two  reasons:  (i)  coefficients  {a^}  are  unequal,  and  (ii)  does  not  change 

demand  coefficients  and  distribute  work  among  more  processors.  A single  faster 
processor  can  (by  earlier  assumption)  help  serial  or  parallel  modes,  whereas  y processors 
each  improved  to  a lesser  degree  cannot  help  a mode  j,j  < y,  as  much. 

The  modest  processing  rate  model  raises  issues  and  invites  discussion.  For  instance, 
the  example  schedulings  do  bias  performance  somewhat  toward  substitution  of  one  faster 
processor.  The  model  probably  exaggerates  the  efficacy  of  load  balancing  mechanisms 
for  shared-memory  machines.  An  opposing  factor  that  has  been  completely  ignored  is 
the  nonlinear  cost  of  upgrading  processors.  A 9x  faster  processor  is  Itkely  significantly 
more  expensive  than  two  5x  units,  this  very  fact  being  a strong  economic  motivation 
behind  parallel  architectures. 
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Appendix:  Lower  Bound  for  Worst  Case  Scheduling 


The  "worst  case",  R^q^  can  be  refined  considerably  via  a lower  bound.  As  before, 
there  are  k faster  proceWors  and  n-k  of  the  originals.  Let  P.  be  a partition  of  the 
workload  actually  handled  by  j active  processors,  i.e.,  all  dem^d  serviced  in  mode  j. 
Argument  begins  with  original  workload  partitions  through  unchanged  and 

assumes  that  these  do  not  involve  improved  processors.  (Remember  that  R.q-.  excludes 
faster  processors  whenever  it  can.)  Partition  P^^  is  special;  it  is  the  partition'^ of  largest 
index  that  does  not  have  to  account  directly  for  faster  processors,  although  workload  will 
be  added  to  it.  In  contrast,  partitions  P^^  to  P^  suffer  degenerate  circumstances  that 
cause  parts  of  their  workload  to  fall  into  other  partitions  (due  to  idle  processors). 
Because  some  processors  handling  workload  in  these  latter  partitions  must  be  faster 
versions  (there  are  but  n-k  originals),  there  will  be  gaps  in  available  load.  R.q>.  does  not 
try  to  rebalance  these  pauses,  so  there  is  idling.  Consequently,  some  workload  in  original 
partition  P.  no  longer  belongs  there,  but  instead  is  assigned  to  a partition  whose  lesser 
index  signifies  fewer  active  processors.  For  calculation  convenience  this  residual  portion 
is  clumped  at  the  end  of  P ’s  processing,  as  if  all  improved  processors  idle 
simultaneously.  Computing  for  tliis  residual  is  attributed  to  partition  P^  The  time  free 
of  idling  can  be  expressed  as  aj/(rj(l+A)),  i.e.,  one  assumes  that  all  j processors  run 

faster  and  thereby  completely  shorten  the  partition  processing  time.  In  reality  this  is  but 
a convenience  for  calculation.  It  isolates  that  fraction  of  workload  not  processed  in  mode 

j- 


a, 


A 

n-k 

1 + A 

j 

t.  4 

The  left  set  of  parentheses  express  a residual  fraction  of  dimension  proportional  to  rate. 
Faster  processors  of  P.  finish  sooner,  leaving  this  fraction  for  the  slower  processors  to 
complete  in  mode  n-k.  The  right  set  of  parentheses  expresses  another  dimension,  that 
fraction  of  processors  identified  with  another  partition  because  they  are  not  idle.  These 
are  n-k  in  number— all  the  slow  processors— of  the  j processors  overall. 
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Assume  that  contention  and  other  factors  of  marginal  efficiency  render 
r.  / i > / (i+1)-  Then  the  worst  possible  per  unit  processing  rate  for  the  dropout  load 

from  P.  is  fhat  from  the  previous  partition  (j-1).  This  is  scaled  up  for  n-k  processors  of 
Pn  Being  pessimistic  (for  a lower  bound),  the  processing  rate  for  the  partition  dropout 
is: 

^ ^ which  is  < r^-ic 

y-1 


Terms  of  relative  time  for  the  higher-numbered  partitions  are  now  expressed  as: 


a,  ” 

J , ^ 

A 

n — k 

‘ 

J - 1 

(1  + A)  ' 

1 + A 

j 

r;_  1 (n  - k) 

1 ■'  J 

This  establishes  a lower  bound  for  the  relative  rate.  Add  terms  for  the  first  n-k  partitions 
and  invert. 


assuming  that 


a > !l!± 

i ~ i + l 


and  k < n. 


R(0)  ^ 


n-k  oc- 

E — + 


'•/ 

Note  that  whenever 


j=n-k+l 


«/  j 0-1  + 0-1)  A Tj 
1 + A j r._i  r, 


-1 


> R 


original 


Ll  _ Li+l 

i i + l 

for  any  1 <i  < n,  (scale-up  with  no  processing  loss),  then  = ^original^  intuitive 
"worst  case"  of  the  text.  Also,  whenever  n = k,  there  are  no  new  opportunities  for  either 
strategy,  since  all  processors  are  improved  models,  and  = R^^^.  As  a cautionary  note, 


some  conditions  of 


n n+i 
— > — 
I 1+1 


determine  that 


R{0)  > R(k). 


Such  is  the  curious  efficacy  of  strategy  R^q^!  It  restricts  amounts  of  parallelism. 

Whenever  added  processors  are  too  parasitic,  they  do  not  pay  their  way,  and  the 
viewpoint  of  R^q^  is  productive. 
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